{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Variable Types\n",
    "\n",
    "A Variable is analogous to a column in a table in a relational database. When creating an Entity, Featuretools will attempt to infer the types of variables present. Featuretools also allows for explicitly specifying the variable types when creating the Entity.\n",
    "\n",
    "**It is important that datasets have appropriately defined variable types when using DFS because this will allow the correct primitives to be used to generate new features.**\n",
    "\n",
    "> Note: When using Dask Entities, users must explicitly specify the variable types for all columns in the Entity dataframe. \n",
    "\n",
    "To understand the different variable types in Featuretools, let's first look at a graph of the variables:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "from featuretools.variable_types import graph_variable_types\n",
    "graph_variable_types()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, there are multiple variable types and some have subclassed variable types. For example, ZIPCode is variable type that is child of Categorical type which is a child of Discrete type. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's explore some of the variable types and understand them in detail."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Discrete\n",
    "\n",
    "A Discrete variable type can only take certain values. It is a type of data that can be counted, but cannot be measured. If it can be classified into distinct buckets, then it a discrete variable type. \n",
    "\n",
    "There are 2 sub-variable types of Discrete. These are Categorical, and Ordinal. If the data has a certain ordering, it is of Ordinal type. If it cannot be ordered, then is a Categorical type. \n",
    "\n",
    "### Categorical \n",
    "\n",
    "A Categorical variable type can take unordered discrete values. It is usually a limited, and fixed number of possible values. Categorical variable types can be represented as strings, or integers. \n",
    "\n",
    "Some examples of Categorical variable types:\n",
    "\n",
    "- Gender\n",
    "- Eye Color\n",
    "- Nationality\n",
    "- Hair Color\n",
    "- Spoken Language\n",
    "\n",
    "### Ordinal\n",
    "\n",
    "A Ordinal variable type can take ordered discrete values. Similar to Categorical, it is usually a limited, and fixed number of possible values. However, these discrete values have a certain order, and the ordering is important to understanding the values. Ordinal variable types can be represented as strings, or integers. \n",
    "\n",
    "Some examples of Ordinal variable types:\n",
    "\n",
    "- Educational Background (Elementary, High School, Undergraduate, Graduate)\n",
    "\n",
    "- Satisfaction Rating (“Not Satisfied”, “Satisfied\", “Very Satisfied”)\n",
    "\n",
    "- Spicy Level (Hot, Hotter, Hottest)\n",
    "\n",
    "- Student Grade (A, B, C, D, F)\n",
    "\n",
    "- Size (small, medium, large)\n",
    "\n",
    "\n",
    "#### Categorical SubTypes (CountryCode, Id, SubRegionCode, ZIPCode)\n",
    "\n",
    "There are also more distinctions within the Categorical variable type. These include CountryCode, Id, SubRegionCode, and ZIPCode.\n",
    "\n",
    "It is important to make this distinction because there are certain operations that can be applied, but they don't necessary apply to all Categorical types. For example, there could be a [custom primitive](https://docs.featuretools.com/en/stable/automated_feature_engineering/primitives.html#defining-custom-primitives) that applies to the ZIPCode variable type. It could extract the first 5 digits of a ZIPCode. However, this operation is not valid for all Categorical variable types. Therefore it is approriate to use the ZIPCode variable type. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Datetime\n",
    "A Datetime is a representation of a date and/or time. Datetime variable types can be represented as strings, or integers. However, they should be in a intrepretable format or properly cast before using DFS. \n",
    "\n",
    "Some examples of Datetime include:\n",
    "\n",
    "- transaction time\n",
    "- flight departure time\n",
    "- pickup time\n",
    "\n",
    "### DateOfBirth\n",
    "A more distinct type of datetime is a DateOfBirth. This is an important distinction because it allows additional primitives to be applied to the data to generate new features. For example, having an DateOfBirth variable type, will allow the Age primitive to be applied during DFS, and lead to a new Numeric feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text\n",
    "Text is a long-form string, that can be of any length. It is commonly used with NLP operations, such as TF-IDF. Featuretools supports NLP operations with the nlp-primitives [add-on](https://innovation.alteryx.com/natural-language-processing-featuretools/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## LatLong\n",
    "A LatLong represents an ordered pair (Latitude, Longitude) that tells the location on Earth. The order of the tuple is important. LatLongs can be represented as tuple of floating point numbers. \n",
    "\n",
    "To make a LatLong in a dataframe do the following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = pd.DataFrame()\n",
    "data['latitude'] = [51.52, 9.93, 37.38]\n",
    "data['longitude'] = [-0.17, 76.25, -122.08]\n",
    "data['latlong'] = data[['latitude', 'longitude']].apply(tuple, axis=1)\n",
    "data['latlong']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## List of Variable Types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also get all the variable types as a DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from featuretools.variable_types import list_variable_types\n",
    "list_variable_types()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
